home *** CD-ROM | disk | FTP | other *** search
Text File | 1996-01-25 | 28.0 KB | 731 lines | [TEXT/CWIE] |
- C.S.M.P. Digest Tue, 19 Dec 95 Volume 3 : Issue 128
- >From erichsen@pacificnet.net (Erichsen)
- Subject: Doubles Vs BlockMove
- Date: 16 Nov 1995 02:22:08 GMT
- Organization: Disorganized
-
- I did some tests (modifying the code in MoveData app from Tricks of the
- Mac Game Programming Gurus) between using doubles in a loop and BlockMove
- in a loop and BlockMove still blew it away (200 ticks vs 146 ticks for
- BlockMove) so why don't more people use BlockMove?
-
- I compared BlockMove vs BlockMoveData and found no difference at all (both
- 146 ticks). Does BlockMove not flush the cache on a 6100?
-
- One of the replies to my previous question of why people don't just use
- BlockMove instead of a copying loop was that the data is not necessarily a
- block but, all the examples of blitters I've seen just copy one contiguous
- block of memory to another contiguous block of memory. Why couldn't
- BlockMove be used?
-
- +++++++++++++++++++++++++++
-
- >From cameron_esfahani@powertalk.apple.com (Cameron Esfahani)
- Date: Mon, 20 Nov 1995 11:55:46 -0800
- Organization: Apple Computer, Inc.
-
- BlockMove/BlockMoveData on the first generation PPC are exactly the same
- function. The reason that BlockMoveData was created in the first place
- was you could tell the system you were not moving code around and to not
- flush the instructino cache. Since the 601 has a unified cache, this
- means that you don't have to worry about cache-coherency. This means you
- don't have to flush the processor cache.
-
- The reason most people don't use BlockMove/BlockMoveData as a blitter is
- that it will be very very slow if you ever use the screen as the
- destination. The reason is that the BlockMove/BlockMoveData routines use
- the PPC instruction DCBZ. This instruction will cause a data-exception
- fault if the address supplied is not copy-back cacheable. The screen
- isn't marked copy-back cacheable.
-
- Hope this helps,
- Cameron Esfahani
- ********
- >From nporcino@sol.uvic.ca (Nick Porcino)
- Date: 20 Nov 1995 20:35:30 GMT
- Organization: Planet IX
-
- We did some tests and found on a Q700 that BlockMoveData was faster than
- BlockMove in the context of an actual game (Riddle of Master Lu)
-
- - Nick Porcino
- Lead Engine Guy
- Sanctuary Woods
-
- +++++++++++++++++++++++++++
-
- >From meggs@virginia.edu (Andrew Meggs)
- Date: Tue, 21 Nov 1995 02:55:08 GMT
- Organization: University of Virginia
-
- In article <erichsen-1511951722510001@pm2-3.pacificnet.net>,
- erichsen@pacificnet.net (Erichsen) wrote:
-
- > I did some tests (modifying the code in MoveData app from Tricks of the
- > Mac Game Programming Gurus) between using doubles in a loop and BlockMove
- > in a loop and BlockMove still blew it away (200 ticks vs 146 ticks for
- > BlockMove) so why don't more people use BlockMove?
- >
-
- This got me interested, so I went and disassembled BlockMove. Surprisingly,
- they aren't using doubles:
-
- BlockMove
- +00060 40A1C558 lwz r5,0x0000(r3)
- +00064 40A1C55C lwz r6,0x0004(r3)
- +00068 40A1C560 lwz r7,0x0008(r3)
- +0006C 40A1C564 lwz r8,0x000C(r3)
- +00070 40A1C568 lwz r9,0x0010(r3)
- +00074 40A1C56C lwz r10,0x0014(r3)
- +00078 40A1C570 lwz r11,0x0018(r3)
- +0007C 40A1C574 lwz r12,0x001C(r3)
- +00080 40A1C578 dcbz 0,r4
- +00084 40A1C57C addi r3,r3,0x0020
- +00088 40A1C580 dcbt 0,r3
- +0008C 40A1C584 stw r5,0x0000(r4)
- +00090 40A1C588 stw r6,0x0004(r4)
- +00094 40A1C58C stw r7,0x0008(r4)
- +00098 40A1C590 stw r8,0x000C(r4)
- +0009C 40A1C594 stw r9,0x0010(r4)
- +000A0 40A1C598 stw r10,0x0014(r4)
- +000A4 40A1C59C stw r11,0x0018(r4)
- +000A8 40A1C5A0 stw r12,0x001C(r4)
- +000AC 40A1C5A4 addi r4,r4,0x0020
- +000B0 40A1C5A8 bdnz BlockMove+00060
-
-
- The performance win is in the dcbz/dcbt pair. I'm assuming you weren't
- copying to video memory, because that's marked uncacheable, and dcbz will
- severely hurt performance if your destination is uncacheable.
-
- I probably would have written it more like this, personally. Does anyone
- have any idea what makes Apple's better? (Assuming it is...)
-
- ;assume source, destination, and size are all 32-byte aligned
- ;set r3 to source address minus 8 and r4 to destination address minus 8
- ;set ctr to size >> 5
-
- BlockMoveLoop
- lfd fp0,8(r3)
- lfd fp1,16(r3)
- lfd fp2,24(r3)
- lfdu fp3,32(r3)
- dcbz 0,r4
- dcbt 0,r3
- stfd fp0,8(r4)
- stfd fp1,16(r4)
- stfd fp2,24(r4)
- stfdu fp3,32(r4)
- bdnz BlockMoveLoop
-
- > I compared BlockMove vs BlockMoveData and found no difference at all (both
- > 146 ticks). Does BlockMove not flush the cache on a 6100?
- >
-
- The unified instruction and data cache on the 601 wouldn't cause any
- problems with treating code as data, so there's no need to maintain
- coherency between the two. In other words, it shouldn't, but on the
- 604 it would need to.
-
- --
- _________________________________________________________________________
- andrew meggs the one who dies with the most
- meggs@virginia.edu AOL free trial disks wins
- _________________________________________________________________________
- dead tv software --==-- the next generation of 3D games for the macintosh
- <http://darwin.clas.virginia.edu/~apm3g/deadtv/index.html>
-
- +++++++++++++++++++++++++++
-
- >From Mark Williams <Mark@streetly.demon.co.uk>
- Date: Wed, 22 Nov 95 09:42:32 GMT
- Organization: Streetly Software
-
-
- In article <meggs-2011952155080001@bootp-188-82.bootp.virginia.edu>, Andrew Meggs writes:
-
- >
- > In article <erichsen-1511951722510001@pm2-3.pacificnet.net>,
- > erichsen@pacificnet.net (Erichsen) wrote:
- >
- > > I did some tests (modifying the code in MoveData app from Tricks of the
- > > Mac Game Programming Gurus) between using doubles in a loop and BlockMove
- > > in a loop and BlockMove still blew it away (200 ticks vs 146 ticks for
- > > BlockMove) so why don't more people use BlockMove?
- > >
- >
- > This got me interested, so I went and disassembled BlockMove. Surprisingly,
- > they aren't using doubles:
- >
- > BlockMove
- > +00060 40A1C558 lwz r5,0x0000(r3)
- > +00064 40A1C55C lwz r6,0x0004(r3)
- > +00068 40A1C560 lwz r7,0x0008(r3)
- > +0006C 40A1C564 lwz r8,0x000C(r3)
- > +00070 40A1C568 lwz r9,0x0010(r3)
- > +00074 40A1C56C lwz r10,0x0014(r3)
- > +00078 40A1C570 lwz r11,0x0018(r3)
- > +0007C 40A1C574 lwz r12,0x001C(r3)
- > +00080 40A1C578 dcbz 0,r4
- > +00084 40A1C57C addi r3,r3,0x0020
- > +00088 40A1C580 dcbt 0,r3
- > +0008C 40A1C584 stw r5,0x0000(r4)
- > +00090 40A1C588 stw r6,0x0004(r4)
- > +00094 40A1C58C stw r7,0x0008(r4)
- > +00098 40A1C590 stw r8,0x000C(r4)
- > +0009C 40A1C594 stw r9,0x0010(r4)
- > +000A0 40A1C598 stw r10,0x0014(r4)
- > +000A4 40A1C59C stw r11,0x0018(r4)
- > +000A8 40A1C5A0 stw r12,0x001C(r4)
- > +000AC 40A1C5A4 addi r4,r4,0x0020
- > +000B0 40A1C5A8 bdnz BlockMove+00060
- >
- >
- > The performance win is in the dcbz/dcbt pair. I'm assuming you weren't
- > copying to video memory, because that's marked uncacheable, and dcbz will
- > severely hurt performance if your destination is uncacheable.
- >
- > I probably would have written it more like this, personally. Does anyone
- > have any idea what makes Apple's better? (Assuming it is...)
-
- consecutive stfd's stall both pipelines. This means that (assuming all cache hits) you get one fp
- store every 3 cycles, compared with one integer store every cycle. The result is 12 cycles to
- transfer 4 words using fp registers, but only 10 cycles using integer registers. (see page I-175 of
- the 601 User manual).
-
- > ;assume source, destination, and size are all 32-byte aligned
- > ;set r3 to source address minus 8 and r4 to destination address minus 8
- > ;set ctr to size >> 5
- >
- > BlockMoveLoop
- > lfd fp0,8(r3)
- > lfd fp1,16(r3)
- > lfd fp2,24(r3)
- > lfdu fp3,32(r3)
- > dcbz 0,r4
- > dcbt 0,r3
- > stfd fp0,8(r4)
- > stfd fp1,16(r4)
- > stfd fp2,24(r4)
- > stfdu fp3,32(r4)
- > bdnz BlockMoveLoop
- >
-
- One other problem with your code (and presumably why apple use the apparently wasteful addi
- instructions rather than load/store with update) is that your dcbt instruction comes too late... fp3
- already contains the double at r3 by the time you hit the dcbt 0,r3 instruction, so it has no
- effect. Much worse, the dcbz always touches the block you wrote the _previous_ time through the
- loop...
-
- this could easily be fixed by preloading r5 with 8 and writing
-
- dcbz r5,r4
- dcbt r5,r3
-
- But you would still lose out on a 601. I _think_ it would be quicker on a 604, but i've not
- checked.
- - --------------------------------------
- Mark Williams<Mark@streetly.demon.co.uk>
-
- +++++++++++++++++++++++++++
-
- >From cameron_esfahani@powertalk.apple.com (Cameron Esfahani)
- Date: Tue, 28 Nov 1995 01:24:06 -0800
- Organization: Apple Computer, Inc.
-
- BlockMoveData was introduced with System 7.5. The code for
- it was kicking around Apple for a little while before we had a shipping
- vehicle for it.
-
- Cameron Esfahani
-
- +++++++++++++++++++++++++++
-
- >From deirdre@deeny.mv.com (Deirdre)
- Date: Tue, 28 Nov 1995 14:46:04 GMT
- Organization: Tarla's Secret Clench
-
- BlockMove was available in System 1.0. However, the distinction between
- BlockMove and the newer call BlockMoveData is only significant on 040s and
- higher. On other machines it is the same trap.
-
- _Deirdre
-
- +++++++++++++++++++++++++++
-
- >From kenp@nmrfam.wisc.edu (Ken Prehoda)
- Date: Wed, 29 Nov 1995 09:26:05 -0600
- Organization: Univ of Wisconsin-Madison, Dept of Biochemistry
-
- As far as I can tell BlockMoveData is _only_ significant on the 040.
- BlockMove does not flush the cache on the PPC's.
- _____________________________________________________________________________
- Ken Prehoda kenp@nmrfam.wisc.edu
- Department of Biochemistry http://www.nmrfam.wisc.edu
- University of Wisconsin-Madison Tel: 608-263-9498
- 420 Henry Mall Fax: 608-262-3453
-
- +++++++++++++++++++++++++++
-
- >From cameron_esfahani@powertalk.apple.com (Cameron Esfahani)
- Date: Wed, 29 Nov 1995 22:53:41 -0800
- Organization: Apple Computer, Inc.
-
- > As far as I can tell BlockMoveData is _only_ significant on the 040.
- > BlockMove does not flush the cache on the PPC's.
-
- That is not true. BlockMove does flush the cache on the new PPCs. Any
- PPC with a split cache (603/604 and any other ones) will require cache
- flushing. So, BlockMove on a 601-based machine doesn't flush the cache
- because it makes no sense, but on > 601-machines, it does flush.
-
- Cameron Esfahani
-
- +++++++++++++++++++++++++++
-
- >From mick@emf.net (Mick Foley)
- Date: Wed, 29 Nov 1995 22:23:29 -0800
- Organization: "emf.net" Quality Internet Access. (510) 704-2929 (Voice)
-
- > As far as I can tell BlockMoveData is _only_ significant on the 040.
- > BlockMove does not flush the cache on the PPC's.
-
- Not on the 601 which has a unified cache. But it should make a big
- difference on the 603 and 604 which have split data and code caches.
-
- Mick
-
- +++++++++++++++++++++++++++
-
- >From Ed Wynne <arwyn@engin.umich.edu>
- Date: 4 Dec 1995 04:09:26 GMT
- Organization: Arwyn, Inc.
-
- Actually, thats almost right... BlockMoveData CAN cause cache flushing on
- 601-based machines if they are running the DR emulator. The processor
- cache doesn't get flushed but the emulator's internal cache of recompiled code
- does. This process is probably a fair amount slower than the real on-chip
- cache flush since it is a software based operation.
-
- To my knowledge the only machines so-far with this configuration would be
- the 7200 and 7500. (does the 8500 have a 601 option?)
-
- -ed
-
- ---------------------------
-
- C.S.M.P. Digest Tue, 26 Dec 95 Volume 3 : Issue 129
-
- ---------------------------
-
- >From steele@isi.edu (Craig S. Steele)
- Subject: Block copy on 604 slow
- Date: Tue, 5 Dec 1995 18:30:53 -0800
- Organization: USC Information Sciences Institute
-
- I'm trying to benchmark block copy rates of various sizes for PowerPCs. My
- results are disappointing for the 604, and cause me to wonder what it is I
- don't understand. Testing on a 9500/120, to which I have limited access, gives
- the following results for copy code using 32-bit integer and 64-bit double
- load and stores, respectively:
-
- Asm lvector copy of 1024 doubles in 44.8 nS/acc, 5.4 clocks/acc, 85.1 MB/s
- Asm dvector copy of 1024 doubles in 34.1 nS/acc, 4.1 clocks/acc, 111.9 MB/s
-
- The source array is aligned to 4K, the destination array to 4K+0x100, to avoid
- possible aliasing interlocks. The source array is preloaded immediately
- before the copy routine is called, so I would expect everything to run at L1
- cache rates.
-
- I would naively expect the copy code to average about 1.5 clocks per load or
- store. Instead, my code reports over 4 clocks/access. The code uses the
- time-base register for timing, which shouldn't cause significant cache
- disturbance.
-
- Can anyone contradict, corroborate, or explain my poor results? If I can't do
- better than this, we'll have to build extra hardware :-(
- Thanks in advance.
- -Craig
-
- exportf2 dvec_copy
- mtctr r5 ; init loop counter
- addi r3,r3,-8 ; predecrement pointer by double size
- addi r4,r4,-8 ; predecrement pointer by double size
- li r6,8 ; cache line alignment constant for dcbz
- b dvc_1
- align 6
- dvc_1
- dcbz r6,r3 ; kill dest. cache line
- lfd fp0,8(r4)
- lfd fp1,16(r4)
- lfd fp2,24(r4)
- lfdu fp3,32(r4)
- stfd fp0,8(r3)
- stfd fp1,16(r3)
- stfd fp2,24(r3)
- stfdu fp3,32(r3)
- bdnz dvc_1 ; test loop condition
- blr
-
-
-
- Craig S. Steele - Not yet Institutionalized
-
-
-
- +++++++++++++++++++++++++++
-
- >From rbarris@netcom.com (Robert Barris)
- Date: Wed, 6 Dec 1995 09:46:47 GMT
- Organization: NETCOM On-line Communication Services (408 261-4700 guest)
-
- In article <9512051830.AA53505@kandor.isi.edu>,
- Craig S. Steele <steele@isi.edu> wrote:
- >I'm trying to benchmark block copy rates of various sizes for PowerPCs. My
- >results are disappointing for the 604, and cause me to wonder what it is I
- >don't understand. Testing on a 9500/120, to which I have limited access, gives
- >the following results for copy code using 32-bit integer and 64-bit double
- >load and stores, respectively:
- >
- >Asm lvector copy of 1024 doubles in 44.8 nS/acc, 5.4 clocks/acc, 85.1 MB/s
- >Asm dvector copy of 1024 doubles in 34.1 nS/acc, 4.1 clocks/acc, 111.9 MB/s
-
- OK, in regular "bytes", you appear to be moving (for examples sake)
- 8192 bytes
- from address (say) 0x1000000
- to address (say) 0x1002100.
-
- So you are reading 8K and writing 8K as I read it... in a perfect world
- all of your data would fit (precisely) into the L1 d cache.
-
- >The source array is aligned to 4K, the destination array to 4K+0x100, to avoid
- >possible aliasing interlocks. The source array is preloaded immediately
- >before the copy routine is called, so I would expect everything to run at L1
- >cache rates.
-
- Except that you are sharing that L1 with things like interrupt tasks, 68K
- interrupt tasks (which invoke the emulator causing additional pollution),
- and so on.
-
- Since as far as I know, there is no way to completely shut off PowerPC
- interrupts, quantifying the effect of background processes on your cache
- population can be a bit tricky.
-
- >I would naively expect the copy code to average about 1.5 clocks per load or
- >store. Instead, my code reports over 4 clocks/access. The code uses the
- >time-base register for timing, which shouldn't cause significant cache
- >disturbance.
-
- When you say per access, do you mean per double "moved" as in a read and
- a write, or per double accessed, as in the read or the write alone?
-
- I guess I can work it out: 110MB/s (say it's 120 for arguments sake) is
- about 1MB per million clocks (at 120MHz). Or about a byte moved per clock, or
- a double moved per 8 clocks. OK so that's 4 per double read, 4 per
- double write (on average).
-
- Suggestions:
- 1. Plot speed versus vector length. Look for nonlinearities.
- (deliberately shrink or grow the vector).
-
- 2. wiggle that 256 byte offset factor some more. or make it zero.
- I do not think the 4-wayness would become a problem until you went
- above 8K vectors, then very little would help...
-
- 3. think about cache hinting at or near the bottom of the loop.
- if for some reason a cache line which you are going to read from
- has been dropped, it's good to schedule its re-fetch as far ahead as
- possible. I'm sure Tim Olson can elaborate much more better good :)
-
- 4. I hear Exponential Technology has a faster BiCMOS 604 coming...
-
- Rob Barris
- Quicksilver Software Inc.
- rbarris@quicksilver.com
- * opinions expressed not necessarily those of my employer *
-
- +++++++++++++++++++++++++++
-
- >From steele@isi.edu (Craig S. Steele)
- Date: Wed, 6 Dec 1995 12:41:15 -0800
- Organization: USC Information Sciences Institute
-
- In article <rbarrisDJ5sHz.MJy@netcom.com>, rbarris@netcom.com (Robert Barris)
- writes:
- > In article <9512051830.AA53505@kandor.isi.edu>, Craig S. Steele
- > <steele@isi.edu> wrote:
- > >I'm trying to benchmark block copy rates of various sizes for
- > >PowerPCs. My results are disappointing for the 604, and cause me
- > >to wonder what it is I don't understand. Testing on a 9500/120, to
- > >which I have limited access, gives the following results for copy
- > >code using 32-bit integer and 64-bit double load and stores,
- > >respectively:
- > >
- > >Asm lvector copy of 1024 doubles in 44.8 nS/acc, 5.4 clocks/acc, 85.1
- MB/s
- > >Asm dvector copy of 1024 doubles in 34.1 nS/acc, 4.1 clocks/acc, 111.9
- MB/s
-
- > So you are reading 8K and writing 8K as I read it... in a perfect
- > world all of your data would fit (precisely) into the L1 d cache.
- Exactly. However, I did benchmark a range of copy sizes from 512B to 1MB; the
- quoted 8KB block results were the fastest. Needless to say the rate for
- larger blocks dropped precipitously as the sizes busted (burst?) the L1 and L2
- caches.
-
- > >...so I would expect everything to run at L1 cache rates.
- > Except that you are sharing that L1 with things like interrupt
- > tasks, 68K interrupt tasks (which invoke the emulator causing
- > additional pollution), and so on.
- True. I would have thought that at least some of my trials would have fit in
- between interrupts, e.g., the critical part of the 8KB case only takes about
- 100uS, and the smaller proportionately less. I also tried back-to-back copy
- calls, producing essentially identical results. I did get _much_ worse results
- when I experimented with using the MacOS Microseconds call for timing, so the
- cache pollution issue is very real. What is the highest rate interrupt source
- on an idle PowerMac anyway?. Is Microseconds non-native? I'm clueless.
-
- > Since as far as I know, there is no way to completely shut off
- > PowerPC interrupts, quantifying the effect of background processes on
- > your cache population can be a bit tricky.
- I believe I know how to do it on an 8100 (although not the 9500) so it's
- probably worth a (probable deathcookies) experiment to see it if makes a
- difference there. I deeply regret having blown up our only hardware prototype
- last month... Maybe next week I'll have a bare machine again, knock on
- Formica(TM).
-
- > >I would naively expect the copy code to average about 1.5 clocks per
- > >load or store. Instead, my code reports over 4 clocks/access.
- > I guess I can work it out ... OK so that's 4 per double
- > read, 4 per double write (on average).
- Yes.
-
- > Suggestions:
- > 1. Plot speed versus vector length. Look for nonlinearities.
- > (deliberately shrink or grow the vector).
- For a 9500/120:
- 512B 49 MB/s
- 1KB 68
- 2KB 87
- 4KB 109
- 8KB 112
- 16KB 68
- 32KB 62
- 64KB 54
- 128KB 53
- 256KB 40
- 512KB 35
- 1024KB 32
-
- The trends are reasonable, it's just the L1 peak rate that seems very low to
- me. The 6100 and 8100, on the other hand, have some huge huge anomalous dips
- for 128KB operations, presumably managing to evict the code from both the L1 &
- L2 unified caches in some particularly malign way.
-
- > 2. wiggle that 256 byte offset factor some more. or make it zero.
- Zero makes things about 10% slower, but I haven't yet tried other offsets.
-
- > 3. think about cache hinting at or near the bottom of the loop.
- > if for some reason a cache line which you are going to read from
- > has been dropped, it's good to schedule its re-fetch as far ahead
- > as possible.
- A prior load loop is supposed to have ensured that the source is in the cache,
- but this is a good suggestion to double check that assumption, and probably
- the right thing to do for a general-purpose copy where cache status is
- uncontrolled. I'll check this out.
-
- > 4. I hear Exponential Technology has a faster BiCMOS 604 coming...
- That certainly does look interesting, "only" $14million capitalization, but
- good credentials. Unfortunately, I have to put something under the tree for
- this Christmas, can't wait for that rosy glow ("Is it Rudolph or is it
- bipolar?") we might see next. :-)
-
-
-
- Craig S. Steele - Not yet Institutionalized
-
-
-
- +++++++++++++++++++++++++++
-
- >From tim@apple.com (Tim Olson)
- Date: 7 Dec 1995 03:33:26 GMT
- Organization: Apple Computer, Inc. / Somerset
-
- In article <9512051830.AA53505@kandor.isi.edu>
- steele@isi.edu (Craig S. Steele) writes:
-
- > I would naively expect the copy code to average about 1.5 clocks per load or
- > store. Instead, my code reports over 4 clocks/access. The code uses the
- > time-base register for timing, which shouldn't cause significant cache
- > disturbance.
- >
- > Can anyone contradict, corroborate, or explain my poor results?
-
- I did a number of measurements awhile back which showed that a 604 can
- perform the loop you gave (without the DCBZ) at about 1.3 cycles per
- doubleword loaded or stored -- this was done by measuring the runtime
- of copying a 64-byte block over many iterations, so both source and
- destination were in the cache. The DCBZ instruction spends multiple
- cycles clearing the allocated cache block, so that will add some
- overhead (I don't have my spec with me -- I seem to remember it is 4
- cycles), which should bring it to somewhere around 15 cycles per loop
- iteration, or about 1.8 cycles per doubleword, which is still far less
- than your reported 4 cycles.
-
- First, try running without the DCBZ to see if it more closely matches
- my results (~1.3 cycles per doubleword); if not, then you might be
- forgetting about some multiplication factor when using the timebase
- register. On the 604, it increments every 4th bus clock.
-
- -- Tim Olson
- Apple Computer, Inc. / Somerset
- tim@apple.com
-
- +++++++++++++++++++++++++++
-
- >From cliffc@ami.sps.mot.com (Cliff Click)
- Date: 7 Dec 95 09:23:08
- Organization: none
-
- steele@isi.edu (Craig S. Steele) writes:
-
- Craig S. Steele <steele@isi.edu> wrote:
- >I'm trying to benchmark block copy rates of various sizes for
- >PowerPCs. My results are disappointing for the 604, and cause me
- >to wonder what it is I don't understand. Testing on a 9500/120, to
- >which I have limited access, gives the following results for copy
- >code using 32-bit integer and 64-bit double load and stores,
- >respectively:
-
- Have you tried using "lmw" and "stmw" instead of "lfd" and "stfd"?
- My 604 book sez these are #regs+2 cycles each, whilst the float
- operations are 3 cycles each. For large enough blocks, you should
- win on the lmw and stmw.
-
- Cliff
- --
- Cliff Click Compiler Researcher & Designer
- RISC Software, Motorola PowerPC Compilers
- cliffc@risc.sps.mot.com (512) 891-7240
-
- +++++++++++++++++++++++++++
-
- >From tim@apple.com (Tim Olson)
- Date: 8 Dec 1995 02:59:57 GMT
- Organization: Apple Computer, Inc. / Somerset
-
- In article <CLIFFC.95Dec7092308@ami.sps.mot.com>
- cliffc@ami.sps.mot.com (Cliff Click) writes:
-
- > Have you tried using "lmw" and "stmw" instead of "lfd" and "stfd"?
- > My 604 book sez these are #regs+2 cycles each, whilst the float
- > operations are 3 cycles each. For large enough blocks, you should
- > win on the lmw and stmw.
-
- The lfd instruction has a 3-cycle latency for using the result of the
- load in a floating-point operation, but the issue-rate of lfd is one
- per cycle. When pipelined in the manner used in the block copy code,
- it can transfer at close to one doubleword per cycle.
-
- Load and store multiple instructions can achieve close to one word per
- cycle for large transfers, but that is half the bandwith of the
- lfd/stfd solution.
-
-
- -- Tim Olson
- Apple Computer, Inc. / Somerset
- tim@apple.com
-
- +++++++++++++++++++++++++++
-
- >From Mark Williams <Mark@streetly.demon.co.uk>
- Date: Thu, 07 Dec 95 18:25:26 GMT
- Organization: Streetly Software
-
-
- In article <CLIFFC.95Dec7092308@ami.sps.mot.com>, Cliff Click writes:
-
- >
- > steele@isi.edu (Craig S. Steele) writes:
- >
- > Craig S. Steele <steele@isi.edu> wrote:
- > >I'm trying to benchmark block copy rates of various sizes for
- > >PowerPCs. My results are disappointing for the 604, and cause me
- > >to wonder what it is I don't understand. Testing on a 9500/120, to
- > >which I have limited access, gives the following results for copy
- > >code using 32-bit integer and 64-bit double load and stores,
- > >respectively:
- >
- > Have you tried using "lmw" and "stmw" instead of "lfd" and "stfd"?
- > My 604 book sez these are #regs+2 cycles each, whilst the float
- > operations are 3 cycles each. For large enough blocks, you should
- > win on the lmw and stmw.
- >
- > Cliff
- > --
- > Cliff Click Compiler Researcher & Designer
- > RISC Software, Motorola PowerPC Compilers
- > cliffc@risc.sps.mot.com (512) 891-7240
-
- But surely the point is that lfd & stfd have a _latency_ of 3 cycles, but a
- throughput of 1 instruction per cycle, whereas the lmw/stmw have both a latency
- and throughput of 1 instruction per #regs+2 cycles. That means the lfd/stfd
- method should be able to move (ie load and store) 1 word per cycle, while the
- lmw/stmw cannot do better than 1 word every 2 cycles (and even with 28 regs
- available it would take 60 cycles to move 28 words).
-
- - --------------------------------------
- Mark Williams<Mark@streetly.demon.co.uk>
-
- +++++++++++++++++++++++++++
-
- >From tjrob@bluebird.flw.att.com (Tom Roberts)
- Date: Sat, 9 Dec 1995 19:19:22 GMT
- Organization: AT&T Bell Laboratories
-
- In article <4a89nd$hrp@cerberus.ibmoto.com>, Tim Olson <tim@apple.com> wrote:
- >In article <CLIFFC.95Dec7092308@ami.sps.mot.com>
- >cliffc@ami.sps.mot.com (Cliff Click) writes:
- >
- >> Have you tried using "lmw" and "stmw" instead of "lfd" and "stfd"?
- >> My 604 book sez these are #regs+2 cycles each, whilst the float
- >> operations are 3 cycles each. For large enough blocks, you should
- >> win on the lmw and stmw.
- >
- >The lfd instruction has a 3-cycle latency for using the result of the
- >load in a floating-point operation, but the issue-rate of lfd is one
- >per cycle. When pipelined in the manner used in the block copy code,
- >it can transfer at close to one doubleword per cycle.
- >
- >Load and store multiple instructions can achieve close to one word per
- >cycle for large transfers, but that is half the bandwith of the
- >lfd/stfd solution.
-
- In practical systems, memory bandwidth is MUCH more important than
- the number of instructions used or their throughput or latency.
- (This assumes that the data actually resides in memory, not just in the
- cache. This also assumes a "long" loop, so the code is in the icache.)
-
- In systems which run the 604 at 1:1 clocking (i.e. internal CPU clock
- equals external bus clock), memory bandwidth can be 2-4 times slower than
- simple calculations. This is due to cache-access limitations and the
- fact that both the CPU and the bus access unit are competing for
- access to the cache. In this mode the memory essentially NEVER
- overlaps address and data tenures on the bus (halving memory bandwidth);
- there are usually several bus clocks between succesive cycles, reducing
- bandwidth even more.
-
- With 1.5:1 clocking this effect is reduced -- the cache can handle
- one access per internal clock, so there is a cycle available to the
- CPU between every 2 bus accesses. At 2:1 this effect should disappear,
- as the CPU can get every other cycle, and keep up with the memory
- bus bandwidth.
-
- Note that only recently have 604 chips been shipping which can go 1.5:1
- at 66 MHz bus clock.
-
- Tom Roberts tjrob@iexist.att.com
-
- ---------------------------
-